Bellabeat¶

1.Summary¶

This project serves as a final milestone to attain the Google Data Analytics Professional Certificate. It involves the case study on Bellabeat, a tech wellness company that manufactures health-focused smart products for women. Bellabeat offer a range of smart devices that collects various health and lifestyle data to empower women with knowledge about their own health and habits. The smart devices work hand in hand with the Bellabeat app to provide users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. Thus, users will be able to better understand their current habits and make healthy decisions.

The objective of this study is to analyze consumers' usage data on non-Bellabeat smart devices and determine how it could unlock new growth opportunities for Bellabeat. The insights drawn will be used to develop high level recommendations for Bellabeat's marketing strategy.

In this project, the exploratory data analysis (EDA) approach will be used to analyze and investigate for trends, patterns, and relationships to derive insights from the dataset. This will be guided through the process of Ask, Prepare, Process, Analyze, Share, and Act using the Python programming language.

2.Ask Phase¶

2.1 Business Task¶

The aim of this project is to draw insights into how consumers use non-Bellabeat smart devices and develop high level recommendations for Bellabeat's marketing strategy with the following questions:

  1. What are some trends in Fitbit smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Stakeholders

  • Urška Sršen - Bellabeat Cofounder and Chief Creative Officer
  • Sando Mur - Bellabeat Cofounder and key member of Bellabeat executive team
  • Bellabeat Marketing Analytics team

3.Prepare Phase¶

3.1 Dataset used:¶

The Fitbit Fitness Tracker Data from the Kaggle web repository will be used for this analysis.

3.2 Accessibility and privacy of data:¶

The dataset is confirmed to be open-source and licensed under the CC0: Public Domain. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. The dataset can be copied, modified, distributed, and used for analysis, even for commercial purposes, all without asking permission.

3.3 Information about about our dataset:¶

The dataset is generated by respondents to a distributed survey via Amazon Mechanical Turk over 31 days between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

3.4 Data Organization and verification:¶

The dataset consists of 18 CSV files in total with each containing various health and activity metrics tracked by Fitbit. Using the elimination approach to remove irrelevant dataframes from the analysis, a total of 13 dataframes were eliminated as they are either duplicates of a larger dataframe, too few of a sample size, or contain data that were not meaningful for the analysis.

Hence, here are the 5 dataframes that will be used for this analysis:

Table Name Type Description
1. dailyActivity_merged Microsoft Excel CSV Daily Activity over 31 days of 33 IDs. Tracking daily: Steps, Distance, Intensities, Calories
2. hourlyCalories_merged Microsoft Excel CSV Hourly Calories burned over 31 days of 33 IDs
3. hourlyIntensities_merged Microsoft Excel CSV Hourly total and average intensity over 31 days of 33 IDs
4. hourlySteps_merged Microsoft Excel CSV Hourly Steps over 31 days of 33 IDs
5. sleepDay_merged Microsoft Excel CSV Daily sleep logs, tracked by: Total count of sleeps a day, Total minutes, Total Time in Bed of 24 IDs

3.5 Data Integrity and limitations:¶

The dataset had the limitation of having too small of a sample size (30 users) that may not represent the entire population and may render conclusions drawn from the analysis to be invalid. Furthermore, demographical information such as age, gender, and ethnicity that is crucial to determine the strategy on Bellabeat's target market were not provided in the dataset.

4. Process Phase¶

The data wrangling, analysis, and visualisation process will be carried out using the Python programming language

4.1 Importing the required libraries¶

Firstly, the following libraries below will be imported for our analysis.

In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns  
import matplotlib.pyplot as plt 
%matplotlib inline
import plotly.express as px 
import datetime as dt

4.2 Importing and previewing the dataframes¶

In [2]:
# Loading the data into the pandas data frame.
daily_activity = pd.read_csv('dailyActivity_merged.csv')
hourly_calories = pd.read_csv('hourlyCalories_merged.csv')
hourly_intensities = pd.read_csv('hourlyIntensities_merged.csv')
hourly_steps = pd.read_csv('hourlySteps_merged.csv')
sleep_day = pd.read_csv('sleepDay_merged.csv')

# Displaying the top 5 rows of each dataset
print('\033[1m' + 'daily_activity') 
display(daily_activity.head())

print('\033[1m' + 'hourly_calories')
display(hourly_calories.head())

print('\033[1m' + 'hourly_intensities')
display(hourly_intensities.head())

print('\033[1m' + 'hourly_steps')
display(hourly_steps.head())

print('\033[1m' + 'sleep_day')
display(sleep_day.head())
daily_activity
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
0 1503960366 4/12/2016 13162 8.50 8.50 0.0 1.88 0.55 6.06 0.0 25 13 328 728 1985
1 1503960366 4/13/2016 10735 6.97 6.97 0.0 1.57 0.69 4.71 0.0 21 19 217 776 1797
2 1503960366 4/14/2016 10460 6.74 6.74 0.0 2.44 0.40 3.91 0.0 30 11 181 1218 1776
3 1503960366 4/15/2016 9762 6.28 6.28 0.0 2.14 1.26 2.83 0.0 29 34 209 726 1745
4 1503960366 4/16/2016 12669 8.16 8.16 0.0 2.71 0.41 5.04 0.0 36 10 221 773 1863
hourly_calories
Id ActivityHour Calories
0 1503960366 4/12/2016 12:00:00 AM 81
1 1503960366 4/12/2016 1:00:00 AM 61
2 1503960366 4/12/2016 2:00:00 AM 59
3 1503960366 4/12/2016 3:00:00 AM 47
4 1503960366 4/12/2016 4:00:00 AM 48
hourly_intensities
Id ActivityHour TotalIntensity AverageIntensity
0 1503960366 4/12/2016 12:00:00 AM 20 0.333333
1 1503960366 4/12/2016 1:00:00 AM 8 0.133333
2 1503960366 4/12/2016 2:00:00 AM 7 0.116667
3 1503960366 4/12/2016 3:00:00 AM 0 0.000000
4 1503960366 4/12/2016 4:00:00 AM 0 0.000000
hourly_steps
Id ActivityHour StepTotal
0 1503960366 4/12/2016 12:00:00 AM 373
1 1503960366 4/12/2016 1:00:00 AM 160
2 1503960366 4/12/2016 2:00:00 AM 151
3 1503960366 4/12/2016 3:00:00 AM 0
4 1503960366 4/12/2016 4:00:00 AM 0
sleep_day
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
0 1503960366 4/12/2016 12:00:00 AM 1 327 346
1 1503960366 4/13/2016 12:00:00 AM 2 384 407
2 1503960366 4/15/2016 12:00:00 AM 1 412 442
3 1503960366 4/16/2016 12:00:00 AM 2 340 367
4 1503960366 4/17/2016 12:00:00 AM 1 700 712

4.3 Check the data information¶

Now we will get an overview (number of entries, null values, column names) of the dataframes and check for any incorrect data types.

In [3]:
print('\033[1m' + 'daily_activity' + '\033[0m') 
daily_activity.info()
daily_activity
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories                  940 non-null    int64  
dtypes: float64(7), int64(7), object(1)
memory usage: 110.3+ KB
In [4]:
print('\033[1m' + 'hourly_calories' + '\033[0m') 
hourly_calories.info()
hourly_calories
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   Calories      22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB
In [5]:
print('\033[1m' + 'hourly_intensities' + '\033[0m')
hourly_intensities.info()
hourly_intensities
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Id                22099 non-null  int64  
 1   ActivityHour      22099 non-null  object 
 2   TotalIntensity    22099 non-null  int64  
 3   AverageIntensity  22099 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 690.7+ KB
In [6]:
print('\033[1m' + 'hourly_steps' + '\033[0m')
hourly_steps.info() 
hourly_steps
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   StepTotal     22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB
In [7]:
print('\033[1m' + 'sleep_day' + '\033[0m')
sleep_day.info()
sleep_day
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Id                  413 non-null    int64 
 1   SleepDay            413 non-null    object
 2   TotalSleepRecords   413 non-null    int64 
 3   TotalMinutesAsleep  413 non-null    int64 
 4   TotalTimeInBed      413 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 16.3+ KB

Notice that the data types of the ActivityDate, ActivityHour, and SleepDay columns are in the object format. We will convert them to the date-time format later on (Section 4.4.3).

4.4 Data cleaning and Manipulation¶

The process involves:

  • Identifying and removing duplicates and nulls
  • Formatting datatypes
  • Renaming columns
  • Sorting

4.4.1 Identifying and dropping duplicates¶

In [8]:
# Identifying number of duplicates in each dataframe
duplicates_daily_activity = print("daily_activity=",daily_activity.duplicated().sum())

duplicates_hourly_calories = print("hourly_calories=",hourly_calories.duplicated().sum())

duplicates_hourly_intensities = print("hourly_intensities=",hourly_intensities.duplicated().sum())

duplicates_hourly_steps = print("hourly_steps=",hourly_steps.duplicated().sum())

duplicates_sleep_day= print("sleep_day=",sleep_day.duplicated().sum())
daily_activity= 0
hourly_calories= 0
hourly_intensities= 0
hourly_steps= 0
sleep_day= 3

Found 3 duplicates in the sleep_activity dataframe.

In [9]:
# Extracting the duplicated rows in sleep_day dataframe
sleep_day.loc[sleep_day.duplicated(), :]
Out[9]:
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
161 4388161847 5/5/2016 12:00:00 AM 1 471 495
223 4702921684 5/7/2016 12:00:00 AM 1 520 543
380 8378563200 4/25/2016 12:00:00 AM 1 388 402
In [10]:
#Dropping the duplicates
sleep_day.drop_duplicates()
Out[10]:
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
0 1503960366 4/12/2016 12:00:00 AM 1 327 346
1 1503960366 4/13/2016 12:00:00 AM 2 384 407
2 1503960366 4/15/2016 12:00:00 AM 1 412 442
3 1503960366 4/16/2016 12:00:00 AM 2 340 367
4 1503960366 4/17/2016 12:00:00 AM 1 700 712
... ... ... ... ... ...
408 8792009665 4/30/2016 12:00:00 AM 1 343 360
409 8792009665 5/1/2016 12:00:00 AM 1 503 527
410 8792009665 5/2/2016 12:00:00 AM 1 415 423
411 8792009665 5/3/2016 12:00:00 AM 1 516 545
412 8792009665 5/4/2016 12:00:00 AM 1 439 463

410 rows × 5 columns

Note: sleep_day dataframe started with 413 entries and now it is at 410 entries after removing the 3 duplicates.

4.4.2 Identifying and dropping nulls¶

Here we found no nulls within the dataframes, thus the removal of nulls is not needed.

In [11]:
# Total number of null values
print("daily_activity =", daily_activity.isnull().sum().sum())
print("hourly_calories =", hourly_calories.isnull().sum().sum())
print("hourly_intensities =", hourly_intensities.isnull().sum().sum())
print("hourly_steps =", hourly_steps.isnull().sum().sum())
print("sleep_activity =", sleep_day.isnull().sum().sum())
daily_activity = 0
hourly_calories = 0
hourly_intensities = 0
hourly_steps = 0
sleep_activity = 0

4.4.3 Renaming columns and formatting datatypes¶

As identified in Section 4.3, the timestamp columns of the respective dataframes are in the 'object' format. We would want to convert them into the 'date-time'format and display the dates in "yyyy-mm-dd". The Date and Time columns of the sleep_day dataframe will be split to merge with the daily_activity dataframe later.

In [12]:
# Convert to date-time format
daily_activity['ActivityDate'] = pd.to_datetime(daily_activity['ActivityDate'])

hourly_calories['ActivityHour'] = pd.to_datetime(hourly_calories['ActivityHour'])

hourly_intensities['ActivityHour'] = pd.to_datetime(hourly_intensities['ActivityHour'])

hourly_steps['ActivityHour'] = pd.to_datetime(hourly_steps['ActivityHour'])

sleep_day['Date'] = pd.to_datetime(sleep_day['SleepDay'])
sleep_day['Time'] = pd.to_datetime(sleep_day['SleepDay']).dt.time

# Rearranging columns 
sleep_day = sleep_day[['Id','Date','Time','TotalSleepRecords','TotalMinutesAsleep','TotalTimeInBed']]
In [13]:
# Rename ActivityDate column
daily_activity = daily_activity.rename(columns={'ActivityDate': 'Date'})
In [14]:
#Adding DayOfWeek Column 
daily_activity['DayOfWeek'] = pd.to_datetime(daily_activity['Date']).dt.day_name()

#Rearraning the column names of the dataframe
DayOfWeek = daily_activity['DayOfWeek']
daily_activity = daily_activity.drop(columns=['DayOfWeek'])
daily_activity.insert(loc=2, column='DayOfWeek', value=DayOfWeek)
In [15]:
print('\033[1m' + 'daily_activity' + '\033[0m')
display(daily_activity.head())

print('\033[1m' + 'hourly_calories' + '\033[0m')
display(hourly_calories.head())

print('\033[1m' + 'hourly_intensities' + '\033[0m')
display(hourly_intensities.head())

print('\033[1m' + 'hourly_steps' + '\033[0m')
display(hourly_steps.head())

print('\033[1m' + 'sleep_day' + '\033[0m')
display(sleep_day.head())
daily_activity
Id Date DayOfWeek TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
0 1503960366 2016-04-12 Tuesday 13162 8.50 8.50 0.0 1.88 0.55 6.06 0.0 25 13 328 728 1985
1 1503960366 2016-04-13 Wednesday 10735 6.97 6.97 0.0 1.57 0.69 4.71 0.0 21 19 217 776 1797
2 1503960366 2016-04-14 Thursday 10460 6.74 6.74 0.0 2.44 0.40 3.91 0.0 30 11 181 1218 1776
3 1503960366 2016-04-15 Friday 9762 6.28 6.28 0.0 2.14 1.26 2.83 0.0 29 34 209 726 1745
4 1503960366 2016-04-16 Saturday 12669 8.16 8.16 0.0 2.71 0.41 5.04 0.0 36 10 221 773 1863
hourly_calories
Id ActivityHour Calories
0 1503960366 2016-04-12 00:00:00 81
1 1503960366 2016-04-12 01:00:00 61
2 1503960366 2016-04-12 02:00:00 59
3 1503960366 2016-04-12 03:00:00 47
4 1503960366 2016-04-12 04:00:00 48
hourly_intensities
Id ActivityHour TotalIntensity AverageIntensity
0 1503960366 2016-04-12 00:00:00 20 0.333333
1 1503960366 2016-04-12 01:00:00 8 0.133333
2 1503960366 2016-04-12 02:00:00 7 0.116667
3 1503960366 2016-04-12 03:00:00 0 0.000000
4 1503960366 2016-04-12 04:00:00 0 0.000000
hourly_steps
Id ActivityHour StepTotal
0 1503960366 2016-04-12 00:00:00 373
1 1503960366 2016-04-12 01:00:00 160
2 1503960366 2016-04-12 02:00:00 151
3 1503960366 2016-04-12 03:00:00 0
4 1503960366 2016-04-12 04:00:00 0
sleep_day
Id Date Time TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
0 1503960366 2016-04-12 00:00:00 1 327 346
1 1503960366 2016-04-13 00:00:00 2 384 407
2 1503960366 2016-04-15 00:00:00 1 412 442
3 1503960366 2016-04-16 00:00:00 2 340 367
4 1503960366 2016-04-17 00:00:00 1 700 712

4.5 Merging dataframes¶

Merging the daily_activity and sleep_day dataframes on Id and Date column as the primary keys.

In [16]:
daily_activity_sleep = daily_activity.merge(sleep_day,on=['Id','Date'],how='left')

display(daily_activity_sleep)
Id Date DayOfWeek TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories Time TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
0 1503960366 2016-04-12 Tuesday 13162 8.500000 8.500000 0.0 1.88 0.55 6.06 0.00 25 13 328 728 1985 00:00:00 1.0 327.0 346.0
1 1503960366 2016-04-13 Wednesday 10735 6.970000 6.970000 0.0 1.57 0.69 4.71 0.00 21 19 217 776 1797 00:00:00 2.0 384.0 407.0
2 1503960366 2016-04-14 Thursday 10460 6.740000 6.740000 0.0 2.44 0.40 3.91 0.00 30 11 181 1218 1776 NaN NaN NaN NaN
3 1503960366 2016-04-15 Friday 9762 6.280000 6.280000 0.0 2.14 1.26 2.83 0.00 29 34 209 726 1745 00:00:00 1.0 412.0 442.0
4 1503960366 2016-04-16 Saturday 12669 8.160000 8.160000 0.0 2.71 0.41 5.04 0.00 36 10 221 773 1863 00:00:00 2.0 340.0 367.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
938 8877689391 2016-05-08 Sunday 10686 8.110000 8.110000 0.0 1.08 0.20 6.80 0.00 17 4 245 1174 2847 NaN NaN NaN NaN
939 8877689391 2016-05-09 Monday 20226 18.250000 18.250000 0.0 11.10 0.80 6.24 0.05 73 19 217 1131 3710 NaN NaN NaN NaN
940 8877689391 2016-05-10 Tuesday 10733 8.150000 8.150000 0.0 1.35 0.46 6.28 0.00 18 11 224 1187 2832 NaN NaN NaN NaN
941 8877689391 2016-05-11 Wednesday 21420 19.559999 19.559999 0.0 13.22 0.41 5.89 0.00 88 12 213 1127 3832 NaN NaN NaN NaN
942 8877689391 2016-05-12 Thursday 8064 6.120000 6.120000 0.0 1.82 0.04 4.25 0.00 23 1 137 770 1849 NaN NaN NaN NaN

943 rows × 20 columns

Merging the hourly_(Calories, Intensities, Steps) dataframes using the Id and ActivityHour columns as primary keys to form a new dataframe.

In [17]:
#Merge hourly dataframes
hourly_metrics = hourly_calories.merge(hourly_intensities,on=['Id','ActivityHour'],how='inner')\
.merge(hourly_steps,on=['Id','ActivityHour'],how='inner')
In [18]:
#Rename columns
hourly_metrics = hourly_metrics.rename(columns={'ActivityHour': 'DateTime'})
hourly_metrics = hourly_metrics.rename(columns={'StepTotal': 'TotalSteps'})

display(hourly_metrics)
Id DateTime Calories TotalIntensity AverageIntensity TotalSteps
0 1503960366 2016-04-12 00:00:00 81 20 0.333333 373
1 1503960366 2016-04-12 01:00:00 61 8 0.133333 160
2 1503960366 2016-04-12 02:00:00 59 7 0.116667 151
3 1503960366 2016-04-12 03:00:00 47 0 0.000000 0
4 1503960366 2016-04-12 04:00:00 48 0 0.000000 0
... ... ... ... ... ... ...
22094 8877689391 2016-05-12 10:00:00 126 12 0.200000 514
22095 8877689391 2016-05-12 11:00:00 192 29 0.483333 1407
22096 8877689391 2016-05-12 12:00:00 321 93 1.550000 3135
22097 8877689391 2016-05-12 13:00:00 101 6 0.100000 307
22098 8877689391 2016-05-12 14:00:00 113 9 0.150000 457

22099 rows × 6 columns

5. Analyze and Share Phase¶

5.1 Summary Statistics¶

This function provides an holistic overview of the dataframes to draw insights for analysis.

In [19]:
#Exclude Id column
cols = set(daily_activity_sleep.columns) - {'Id'}
summary_daily_activity = daily_activity_sleep[list(cols)]
summary_daily_activity.describe()
Out[19]:
ModeratelyActiveDistance Calories LightlyActiveMinutes TotalMinutesAsleep LoggedActivitiesDistance TotalSleepRecords SedentaryActiveDistance VeryActiveDistance TrackerDistance TotalSteps VeryActiveMinutes LightActiveDistance TotalDistance TotalTimeInBed FairlyActiveMinutes SedentaryMinutes
count 943.000000 943.000000 943.000000 413.000000 943.000000 413.000000 943.000000 943.000000 943.000000 943.000000 943.000000 943.000000 943.000000 413.000000 943.000000 943.000000
mean 0.570880 2307.507953 193.025451 419.467312 0.110045 1.118644 0.001601 1.504316 5.488547 7652.188759 21.239661 3.349258 5.502853 458.639225 13.628844 990.353128
std 0.884775 720.815522 109.308468 118.344679 0.622292 0.345521 0.007335 2.657626 3.909291 5086.532832 32.946264 2.046505 3.926509 127.101607 20.000746 301.262473
min 0.000000 0.000000 0.000000 58.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 61.000000 0.000000 0.000000
25% 0.000000 1829.500000 127.000000 361.000000 0.000000 1.000000 0.000000 0.000000 2.620000 3795.000000 0.000000 1.950000 2.620000 403.000000 0.000000 729.000000
50% 0.240000 2140.000000 199.000000 433.000000 0.000000 1.000000 0.000000 0.220000 5.260000 7439.000000 4.000000 3.380000 5.260000 463.000000 7.000000 1057.000000
75% 0.805000 2796.500000 264.000000 490.000000 0.000000 1.000000 0.000000 2.065000 7.715000 10734.000000 32.000000 4.790000 7.720000 526.000000 19.000000 1229.000000
max 6.480000 4900.000000 518.000000 796.000000 4.942142 3.000000 0.110000 21.920000 28.030001 36019.000000 210.000000 10.710000 28.030001 961.000000 143.000000 1440.000000
In [20]:
#Exclude Id column
cols = set(hourly_metrics.columns) - {'Id'}
summary_hourly_metrics = hourly_metrics[list(cols)]

summary_hourly_metrics.describe()
Out[20]:
TotalSteps Calories AverageIntensity TotalIntensity
count 22099.000000 22099.000000 22099.000000 22099.000000
mean 320.166342 97.386760 0.200589 12.035341
std 690.384228 60.702622 0.352219 21.133110
min 0.000000 42.000000 0.000000 0.000000
25% 0.000000 63.000000 0.000000 0.000000
50% 40.000000 83.000000 0.050000 3.000000
75% 357.000000 108.000000 0.266667 16.000000
max 10554.000000 948.000000 3.000000 180.000000

5.2 Distribution of the different activity levels¶

Create a distribution of the different activity levels by minutes:

  • Lightly Active Minutes
  • Fairly Active Minutes
  • Very Active Minutes
In [21]:
fig, axes = plt.subplots(1, 3, figsize=(25, 6))

plt.style.use("seaborn-colorblind")

fig.suptitle("Distribution of Activity Types"
             , fontsize=20, fontweight="bold", y="1.03")

min_ylim, max_ylim = plt.ylim()

# Plot Histogram for Lightly Active Minutes
axes[0].hist(daily_activity_sleep["LightlyActiveMinutes"],
           histtype="bar", bins=10, edgecolor='black')
axes[0].set_xlabel("Lightly Active Minutes", fontsize=15)
axes[0].set_ylabel("No. of Records", fontsize=15)
axes[0].axvline(daily_activity_sleep["LightlyActiveMinutes"].mean()
                , color='red', linestyle='dashed', linewidth=2)
axes[0].text(daily_activity_sleep["LightlyActiveMinutes"].mean()*1.05
           , max_ylim*188, 'Mean: {:.2f}'.format(daily_activity_sleep["LightlyActiveMinutes"].mean()))

# Plot Histogram for Fairly Active Minutes
axes[1].hist(daily_activity_sleep["FairlyActiveMinutes"],
             histtype="bar", color="y", bins=10, edgecolor='black')
axes[1].set_xlabel("Fairly Active Minutes", fontsize=15)
axes[1].set_ylabel("No. of Records", fontsize=15)
axes[1].axvline(daily_activity_sleep["FairlyActiveMinutes"].mean()
                , color='red', linestyle='dashed', linewidth=2)
axes[1].text(daily_activity_sleep["FairlyActiveMinutes"].mean()*1.2
             , max_ylim*645, 'Mean: {:.2f}'.format(daily_activity_sleep["FairlyActiveMinutes"].mean()))

# Plot Histogram for Very Active Minutes
axes[2].hist(daily_activity_sleep["VeryActiveMinutes"],
             histtype="bar", color="g", bins=10, edgecolor='black')
axes[2].set_xlabel("Very Active Minutes", fontsize=15)
axes[2].set_ylabel("No. of Records", fontsize=15)
axes[2].axvline(daily_activity_sleep["VeryActiveMinutes"].mean()
                , color='red', linestyle='dashed', linewidth=2)
axes[2].text(daily_activity_sleep["VeryActiveMinutes"].mean()*1.2
             , max_ylim*645, 'Mean: {:.2f}'.format(daily_activity_sleep["VeryActiveMinutes"].mean()))
Out[21]:
Text(25.487592788971366, 645.0, 'Mean: 21.24')

From the histograms above showed that the records of 'Lightly Active Minutes' is close to a normal distribution curve where there are higher occurences around the mean region. Users are also seen spending most of their time in the Lightly Active category(Examples of activities include: Gardening, Walking etc.) and lesser time in the Fairly Active and Very Active category (Example: high cardio activities such as running). The findings are reasonable given that the average user could be non-atheletes that may be using the device for daily lifestyle acivities and to clock occasional mid-high intensity activities.

5.3 Average Time Spent on each activity level¶

In [22]:
#Average of activity levels
average_active_min = daily_activity_sleep[['VeryActiveMinutes', 'FairlyActiveMinutes',
                                               'LightlyActiveMinutes', 'SedentaryMinutes']].mean()
activity_level_min = pd.DataFrame(average_active_min) 
activity_level_min.reset_index(inplace=True)
activity_level_min = activity_level_min.rename(columns = {'index':'ActivityLevel', 0:'AverageMinutes'})

activity_level_min
Out[22]:
ActivityLevel AverageMinutes
0 VeryActiveMinutes 21.239661
1 FairlyActiveMinutes 13.628844
2 LightlyActiveMinutes 193.025451
3 SedentaryMinutes 990.353128
In [23]:
#Plot a piechart to show the distribution of average time spent in each activity level
fig = px.pie(activity_level_min, values='AverageMinutes', names ='ActivityLevel',
             title = "Average total time spent in each activity level")

fig.update_traces(textposition='inside')

From the pie chart, users are seen spending 16.5 hours being sedentary, 3.2 hours of their day being lightly active, 13.6 minutes being fairly active, and 21 minutes being very active daily.

Although users spent 21 minutes on average daily in intense activities, a significant amount of their day is spent being sedentary. This presents a lifestyle concern that has to be address or health conditions could surface in the long run which beats the purpose of owning Bellabeat's health and lifestyle devices.

5.4 Average calories burned by day of week¶

In [24]:
daily_activity_sleep.columns
Out[24]:
Index(['Id', 'Date', 'DayOfWeek', 'TotalSteps', 'TotalDistance',
       'TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories', 'Time',
       'TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed'],
      dtype='object')
In [25]:
sort_days = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
In [26]:
calories = daily_activity_sleep.groupby("DayOfWeek").mean()['Calories'].reindex(sort_days)
avg_calories_dow = pd.DataFrame(calories)
avg_calories_dow.reset_index(inplace=True)
display(avg_calories_dow)
DayOfWeek Calories
0 Monday 2338.099174
1 Tuesday 2356.013158
2 Wednesday 2302.620000
3 Thursday 2204.297297
4 Friday 2331.785714
5 Saturday 2365.592000
6 Sunday 2263.000000
In [27]:
# plot bar plot for average calories burned by day of week
sns.set_style("darkgrid")
plt.figure(figsize=(8,4))
sns.set_context("notebook")
ax = sns.barplot(data=avg_calories_dow, x="DayOfWeek", y="Calories", ci=None, palette="RdBu")
plt.title("Average calories burned by day of week", fontsize=15, fontweight="bold")
plt.xlabel("")
plt.ylabel("Average Calories Burned", fontsize=15)
plt.xticks(rotation="45")

# display the value on each bar
ax = plt.gca()

for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2, p.get_height(), '%d' % int(p.get_height()), 
            fontsize=12, color='black', ha='center', va='bottom')

Based on the bar chart plotted above, we can see that users burned a consistent amount of calories throughout the week with the lowest being on Thursday. However, the required amount of daily calories burned betwen men and women varies from gender, age, and lifestyle demographics of the sample population does not provide a holistic picture of the data. Nevertheless, according to the U.S. Department of Health and Human Services, the average adult women expends roughly 1,600 to 2,400 calories per day, and the average adult man uses 2,000 to 3,000 calories per day.

Furthermore, the average sedentary person burns approximately 1800 calories a day. Thus, the mean of 2307 calories (Refer to section 5.1) daily is reasonably accurate as the average user spend most of their time being sedentary while a small subset of very active users could be skewing the mean of the data.

5.5 Average calories burned hourly¶

In [28]:
hourly_metrics.head()
Out[28]:
Id DateTime Calories TotalIntensity AverageIntensity TotalSteps
0 1503960366 2016-04-12 00:00:00 81 20 0.333333 373
1 1503960366 2016-04-12 01:00:00 61 8 0.133333 160
2 1503960366 2016-04-12 02:00:00 59 7 0.116667 151
3 1503960366 2016-04-12 03:00:00 47 0 0.000000 0
4 1503960366 2016-04-12 04:00:00 48 0 0.000000 0
In [29]:
df = hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["Calories"].mean()
print(df)
DateTime
0      71.805139
1      70.165059
2      69.186495
3      67.538049
4      68.261803
5      81.708155
6      86.996778
7      94.477981
8     103.337272
9     106.142857
10    110.460710
11    109.806904
12    117.197397
13    115.309446
14    115.732899
15    106.637158
16    113.327453
17    122.752759
18    123.492274
19    121.484547
20    102.357616
21     96.056354
22     88.265487
23     77.593577
Name: Calories, dtype: float64
In [30]:
fig = px.line(hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["Calories"].mean(),
              title="Average of total calories burned hourly", markers=True, y="Calories")
fig.update_layout(xaxis={'range':[0,24]}, xaxis_title="Time of Day(Hour)", yaxis_title="Average of total calories Burned",
                  hoverlabel=dict(
        bgcolor="white",
        font_size=14,
        font_family="Rockwell"
    ))
fig.update_traces(hovertemplate='Time of Day(Hour): %{x} <br> Average of Total Calories Burned: %{y}') 

According to Sleep Foundation, it was found that we burn about 50 calories an hour while sleeping which is reflected on the graph above. There is an obvious trend where users begin to increase their calories burned gradually from 4am to mid-day. A slight drop in the amount of calories burned from 12pm to 1pm was also observed. This is likely due to the occurence of postprandial somnolence (A.K.A Food Coma) which usually happens after lunch between 1-3pm leading to fewer calories burned while being tired.

The calories burn is observed to begin increasing again at 4pm while reaching its peak at 6pm indicating that users could likely be choosing this hour to work out or commute after work/school hours. There is a significant decrease on the 7pm marks to 11pm which indicate that most users chose these period as their resting period until they are ready for bedtime.

5.6 Total steps by day of week¶

In [31]:
steps = daily_activity_sleep.groupby("DayOfWeek")['TotalSteps'].mean().reindex(sort_days)
avg_steps_dow = pd.DataFrame(steps)
avg_steps_dow.reset_index(inplace=True)

display(avg_steps_dow)
DayOfWeek TotalSteps
0 Monday 7819.082645
1 Tuesday 8125.006579
2 Wednesday 7559.373333
3 Thursday 7420.682432
4 Friday 7448.230159
5 Saturday 8202.712000
6 Sunday 6933.231405
In [32]:
sns.set_style("darkgrid")
plt.figure(figsize=(8,6))
sns.set_context("notebook")
sns.boxplot(data=daily_activity_sleep, x="DayOfWeek", y="TotalSteps", 
            palette="colorblind", sym="", order=sort_days)
plt.title("Total steps by day of week", fontsize=15, fontweight="bold")
plt.xlabel("")
plt.ylabel("Total Steps", fontsize=15)
plt.xticks(rotation="45")
Out[32]:
(array([0, 1, 2, 3, 4, 5, 6]),
 [Text(0, 0, 'Monday'),
  Text(1, 0, 'Tuesday'),
  Text(2, 0, 'Wednesday'),
  Text(3, 0, 'Thursday'),
  Text(4, 0, 'Friday'),
  Text(5, 0, 'Saturday'),
  Text(6, 0, 'Sunday')])

As observed from the boxplot, we can see that users clocked the highest amount of steps on Saturdays and the lowest average total step is on Sunday which could be likely a rest day for them. The median of steps took throughout the week varies but is rather consistent, hovering between the 6000-7000 range while the mean is at 7652 steps. This indicates that the dataset is fairly distibuted across the lowest to highest values.

Based on MedicineNet, here are the classification of activity levels based on the number of steps taken in a day:

  • Sedentary: Less than 5,000 steps daily
  • Low active: About 5,000 to 7,499 steps daily
  • Somewhat active: About 7,500 to 9,999 steps daily
  • Active: More than 10,000 steps daily
  • Highly active: More than 12,500 steps daily

The data above signals that the average Bellabeat user is classified as somewhat active despite spending a significant amount of their time being sedentary. MedicineNet also claimed that studies have shown improvement on blood sugar levels, lower blood pressure, improve symptoms of depression and anxiety for people who walk between 7,500 to 10,000 steps per day.

5.7 Average steps taken hourly¶

In [33]:
fig = px.line(hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["TotalSteps"].mean(), 
              title="Average of total steps taken hourly", markers=True, y="TotalSteps")
fig.update_layout(xaxis={'range':[0,24]}, xaxis_title="Time of Day", 
yaxis_title="Average of Total Steps Taken")
#print("plotly express hovertemplate:", fig.data[0].hovertemplate)
fig.update_traces(hovertemplate='Time of Day: %{x} <br>Average of Total Steps: %{y}') 

This line chart has a closely identical pattern as compared to the average calories line chart (Section 5.5) as generated above. Users are seen generally starting their day from 5am onwards and reducing the number of steps taken after 7pm.

5.8 Correlation analysis of calories vs steps¶

In [34]:
px.defaults.template = "ggplot2"
px.defaults.color_continuous_scale = px.colors.sequential.Blackbody
px.defaults.width = 800
px.defaults.height = 600
fig = px.scatter(x=daily_activity_sleep["TotalSteps"], y=daily_activity_sleep["Calories"],
                 title=" Correlation betwen Total Steps and Calories", 
                 labels=dict(x="Total Steps",y="Calories"))
fig.update_layout(
    xaxis={'range':[0,32000]})

From the scatter plot, we can observe a positive linear relationship between both variables. This indicates that users burned more calories with higher steps taken. To further prove our analysis, we can write a linregress() code to find the R Value (Pearson's Correlation Coefficient) that determine the level of linear regression between both variables.

In [35]:
from scipy.stats import linregress
xs = daily_activity_sleep["TotalSteps"]
ys = daily_activity_sleep["Calories"]

res = linregress(xs,ys)
print(res)
LinregressResult(slope=0.08402718288211401, intercept=1664.5160890160178, rvalue=0.5929492519076744, pvalue=1.3086580542942342e-90, stderr=0.0037199124194150276, intercept_stderr=34.17491705694571)

As seen from the results, the linear regression have an r value of 0.6 indicating a strong linear relationship between both variables.

linregress() is also a useful function that provides the regression slope value, intercept, p value and standard error of the analysis. For the importance of our analysis, the regression slope measures the steepness of the linear relationship shown by a best fit line. The steeper the line, the higher the effect on change the x variable has on the y variable. In this case, for every 1 step users take, they would expend an average of 0.08 calories. The r value of 0.6 should not be taken as a face value of a strong relationship between both variables as the r value only computes the strength of a linear relationship.

5.9 Activity level by distance¶

In [36]:
# Mean of active distance level
activity_level_avg_dist = daily_activity_sleep[['SedentaryActiveDistance','LightActiveDistance',
                                            'ModeratelyActiveDistance','VeryActiveDistance']].mean()
In [37]:
# covert into pandas dataframe
active_distance = pd.DataFrame(activity_level_avg_dist) 
active_distance.reset_index(inplace=True)
active_distance = active_distance .rename(columns = {'index':'ActiveDistanceLevel', 0:'AverageActiveDistance'})

active_distance.head()
Out[37]:
ActiveDistanceLevel AverageActiveDistance
0 SedentaryActiveDistance 0.001601
1 LightActiveDistance 3.349258
2 ModeratelyActiveDistance 0.570880
3 VeryActiveDistance 1.504316
In [38]:
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.set_context("notebook")

ax = sns.barplot(x="ActiveDistanceLevel", y="AverageActiveDistance", data=active_distance, ci=None, palette="dark")
ax.set(xlabel="",ylabel="Average Distance")
plt.title("Average Distance of Activity Levels",fontsize=20)
plt.xticks(rotation=45)

ax = plt.gca()

for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%f' % float(p.get_height()), 
            fontsize=14, color='black', ha='center', va='bottom')

This barchart depicts the average distance users clocked in the respective activity levels:

  • Sedentary Active Distance
  • Lightly Active Distance
  • Moderate Active Distance
  • Very Active Distance

The highest distance of 3.35km is clocked in the lightly active level. This further reinforce our assumptions in Section 5.2 that users are likely wearing their watches for daily lifestyle activities (e.g walking, doing chores, gardening etc). The second highest distance clocked is at 1.5km in the very active level. Sedentary active clocked the lowest with a distance that is almost insignificant which makes sense as users are most likely inactive and not moving.

5.10 Average time of sleep activity¶

In [39]:
daily_activity_sleep['AwakeTimeInbed'] = daily_activity_sleep['TotalTimeInBed'] - daily_activity_sleep['TotalMinutesAsleep']

sleep = daily_activity_sleep.groupby("DayOfWeek")[['TotalTimeInBed','TotalMinutesAsleep',
                                                   'AwakeTimeInbed']].mean().reindex(sort_days)

sleep_dow = pd.DataFrame(sleep)
sleep_dow.reset_index(inplace=True)

display(sleep_dow)
DayOfWeek TotalTimeInBed TotalMinutesAsleep AwakeTimeInbed
0 Monday 456.170213 418.829787 37.340426
1 Tuesday 443.292308 404.538462 38.753846
2 Wednesday 470.030303 434.681818 35.348485
3 Thursday 435.800000 402.369231 33.430769
4 Friday 445.052632 405.421053 39.631579
5 Saturday 461.275862 420.810345 40.465517
6 Sunday 503.509091 452.745455 50.763636
In [49]:
sleep_dow.plot(x="DayOfWeek", kind="bar", figsize=(12,6), ylabel="Average of Total Mins")
plt.title("Average Time of Sleep Activity", fontsize=20, fontweight="bold")
plt.xlabel("")
plt.xticks(rotation=45)
plt.ylabel("Average of Total Mins", fontsize=15)
Out[49]:
Text(0, 0.5, 'Average of Total Mins')

It is calculated that users have a mean sleep schedule of 419.5 minutes(~ 7hrs) that is consistent across the week and within the healthy range. The highest recorded mean time asleep was on Sundays (~ 7.5hrs) and the lowest was on Thursdays (~ 6.7 hrs).

Comparing this chart with section 5.6 (Total steps by day of week), we understand that the lowest total steps on average was also recorded on Sunday. This reinforces our assumption that Sundays are likely a rest day for users.

5.11 Propertion of users with adequate sleep¶

In [50]:
# Categorizing users based on their amount of sleep
def sleep_grp_if(TotalMinutesAsleep): 
    if (TotalMinutesAsleep > 420) :
        return 'Adequate Sleep'
    else:
        return 'Inadequate Sleep'
    
sleep_amt = sleep_day.loc[:,("Id", "Date", "TotalMinutesAsleep")]
sleep_amt['sleep_type'] = sleep_amt['TotalMinutesAsleep'].apply(sleep_grp_if)
sleep_amt.head()
Out[50]:
Id Date TotalMinutesAsleep sleep_type
0 1503960366 2016-04-12 327 Inadequate Sleep
1 1503960366 2016-04-13 384 Inadequate Sleep
2 1503960366 2016-04-15 412 Inadequate Sleep
3 1503960366 2016-04-16 340 Inadequate Sleep
4 1503960366 2016-04-17 700 Adequate Sleep
In [51]:
# Identifying the number of users for each sleep category
sleep_proportion = sleep_amt['sleep_type'].value_counts()
sleep_proportion = pd.DataFrame(sleep_proportion)
sleep_proportion.reset_index(inplace=True)
sleep_proportion = sleep_proportion.rename(columns = {'index':'sleep_type', 'sleep_type':'sleep_type_count'})
display(sleep_proportion)
sleep_type sleep_type_count
0 Adequate Sleep 230
1 Inadequate Sleep 183
In [52]:
#Plotting the piechart 

fig = px.pie(sleep_proportion, values='sleep_type_count', names='sleep_type', title = "Proportion of users by sleep adequacy")

fig.update_traces(textposition="inside", labels=["Adequate Sleep","Inadequate Sleep"],textfont_size=20)

The piechart generated shows a generally balanced proportion of users with adequate and inadequate sleep. However, I believe there could be intiatives to encourage more users to get at least 7 hours of sleep.

5.12 Distribution of users sleep hours¶

In [53]:
# Categorizing users based on sleep hours
def sleep_grp_hrs(TotalMinutesAsleep): 
    if (TotalMinutesAsleep <= 420) :
        return 'Less than 7hrs'
    elif (TotalMinutesAsleep <=540):
        return '7hrs to 9hrs'
    else:
        return 'More than 9hrs'
In [54]:
sleep_distribution = sleep_day.loc[:,("Id", "Date", "TotalMinutesAsleep")]
sleep_distribution['sleep_grp_hrs'] = sleep_distribution['TotalMinutesAsleep'].apply(sleep_grp_hrs)
sleep_distribution.head()
Out[54]:
Id Date TotalMinutesAsleep sleep_grp_hrs
0 1503960366 2016-04-12 327 Less than 7hrs
1 1503960366 2016-04-13 384 Less than 7hrs
2 1503960366 2016-04-15 412 Less than 7hrs
3 1503960366 2016-04-16 340 Less than 7hrs
4 1503960366 2016-04-17 700 More than 9hrs
In [55]:
sleep_proportion_hrs = sleep_distribution['sleep_grp_hrs'].value_counts()
sleep_proportion_hrs = pd.DataFrame(sleep_proportion_hrs)
sleep_proportion_hrs.reset_index(inplace=True)
sleep_proportion_hrs = sleep_proportion_hrs.rename(columns = {'index':'sleep_grp_hrs', 'sleep_grp_hrs':'sleep_count'})

display(sleep_proportion_hrs)
sleep_grp_hrs sleep_count
0 7hrs to 9hrs 191
1 Less than 7hrs 183
2 More than 9hrs 39
In [56]:
X1 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == 'Less than 7hrs','TotalMinutesAsleep']
X2 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == '7hrs to 9hrs','TotalMinutesAsleep']
X3 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == 'More than 9hrs','TotalMinutesAsleep']



plt.figure(figsize=(14,8))
plt.hist(X1, color='r', label='Less than 7hrs', edgecolor='k',alpha=0.7, bins=20)
plt.hist(X2, color='g', label='7hrs to 9hrs', edgecolor='k',alpha=0.7, bins=20)
plt.hist(X3, color='b', label='More than 9hrs', edgecolor='k',alpha=0.7, bins=20)
plt.title('Distribution of Users Sleep Hours', fontsize=20, fontweight="bold")
plt.xlabel('Sleep Time (Minutes)', fontsize=15)
plt.ylabel('Frequency', fontsize=15)

plt.legend()
Out[56]:
<matplotlib.legend.Legend at 0x1a298b97340>

Here, we breakdown the various sleep hours in a normal distribution curve, showing that majority of users get approximately 340 - 540 minutes (5.6hrs-9hrs) of sleep.

5.13 Correlation matrrix of daily activities¶

In [57]:
# Creating a dataframe containing correlation coefficients of variables in daily_activity_sleep
total_corr = daily_activity_sleep[["TotalSteps", "TotalDistance", "LoggedActivitiesDistance","VeryActiveDistance", "ModeratelyActiveDistance", "LightActiveDistance", "SedentaryActiveDistance", "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "SedentaryMinutes", "TotalMinutesAsleep", "TotalTimeInBed", "Calories"]].corr()

# plotting the heatmap
fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(total_corr, annot=True, fmt = '.2f', cmap="viridis")
plt.title("Correlation Heatmap of daily_activity dataset", fontsize = 25)
Out[57]:
Text(0.5, 1.0, 'Correlation Heatmap of daily_activity dataset')

Finally, we ran a correlation heatmap to provide us an overview on the correlation levels across the variables within the daily_activity_sleep dataframe. Some of the relevant variables pairs identified with strong correlation (R > 0.6) are:

  • TotalDistance and Calories
  • VeryActiveDistance and VeryActiveMinutes
  • LightlyActiveMinutes and LightActiveDistance
  • FairlyActiveMinutes and ModeratelyActiveDistance
  • VeryActiveMinutes and VeryActiveDistance
  • (TotalSteps, TotalDistance) and VeryActiveDistance

6. Act Phase¶

6.1 Insights¶

  1. The most frequent physical intensity activities on a daily basis users spent on is in the lightly active level, with an average time spent of (~ 3.2hrs) and highest distance clocked of (3.35km) .

  2. Although users spent 21 minutes on average in the Very Active category, 81% of their day is spent being sedentary which highlights a concern.

  3. The average user burns 2307 calories and clocks 7652 steps per day.

  4. The highest burned is 2365 calories on Saturdays and lowest burned is 2204 calories on Thursdays.

  5. The average user burn the highest calories between 5pm-7pm.They gradually reduce from 7pm onwards.

  6. The highest average number of steps clocked (8202 steps) are on Saturday and the lowest(6993 steps) are on Sundays.

  7. The average user begins their day at 5am and clocked the highest number of steps between 5-7pm. They gradually reduce their activeness from 7pm onwards.

  8. There is a strong positive linear relationship between total steps clocked and total calories burned.

  9. Users have a consistent sleep schedule with a mean sleep hours of 419.5 minutes (~ 7hrs) across the week. The highest recorded mean time asleep was on Sundays (~ 7.5hrs) and the lowest was on Thursdays (~ 6.7 hrs).

  10. Majority of users get approximately 340 - 540 minutes (5.6hrs-9hrs) of sleep.

  11. 44.3% of users have inadequate sleep hours(<7hours).

  12. At least 5 relevant pairs of variables are found to have a strong correlation (r >0.6).

6.2 Recommendations¶

Demographical information¶

Some of this market segments could consist of:

  1. Age group
  2. Lifestyle
  3. Hobbies
  4. Fitness goals
  5. Working hours
  6. BMI

With the informations above, allows Bellebeat Apps to provide a suitable program for the user. This will play a crucial role in identifying how the products are accepted in different segments which will ultimately influence how Bellebeat drive its marketing campaigns.

A Friendly Reminder¶

Bellabeat could allow users to configure the app and device settings that will serve as reminders and motivations to achieve their desired lifestyle goals. A notification via the app that will remind users to get active or practice a consistent bedtime routine. The push notifications could also include positive reinforcements to users by showing their progress achieved throughout the day or week with the data collected.

Provide useful health articles¶

Bellabeat could incorporate breathing or mindfulness functions in the app to help users wind down their anxiety and stress levels before bed time. This functions could also be interlinked with notifications that would remind users to practice breathing and mindfulness activites before their scheduled bedtime. Provide a follow along video to demonstrate breathing and relaxation techniques that would improve one's sleep quality.